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Abstract 


The classification of large dimensional data sets arising from the merging of remote sensing data 
with more traditional forms of ancillary data causes a significant computational problem. Decision 
tree classification is a popular approach to the problem. Ti is type of classifier is characterized by 
the property that samples are subjected to a sequence o' uecision rules before they are assigned to a 
unique class. If a decision tree classifier is well designed, the result in many cases is a classification 
scheme which is accurate, flexible and computationally efficient. 

This paper provides an automated technique for effective decision tree design which relies only 
on apriori statistics. This procedure utilizes a set of two dimensional canonical transforms and Bayes 
table look-up decision rules. An optimal design at each node is derived based on the associated de- 
cision table. A procedure for computing the global probability of correct classification is also 
provided. 

An example is given in which class statistics obtained from an actual LANDSAT scene are used 
as input to the program. The resulting decision tree design has an associated probability of correct 
classification of .76 compared to the theoretically optimum .79 probability of correct classification 
associated with a full dimensional Bayes classifier. 

Recommendations for future research are included. 
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1. Introduction 


The tendency in remote sensing technology is toward the merging of remote sensing information 
with collateral data to form high dimensional data sets. The classification of such data to produce 
conventional thematic maps creates problems of two kinds. First, the computational expense associ- 
ated with classification increases steeply with dimension. Second, there is the familiar fact 1,2,3 that 
for a fixed number of training samples, classification accuracy can actually decline with an increase 
in dimension. A conventional solution to these problems is to reduce dimensionality by means of a 
feature extraction transformation the coefficients of which are chosen so as to optimize a class separ- 
ability measure. 4 A decision rule is then applied to assign the reduced dimensional samples to the 
available classes. But the feature extraction represents a compromise since the separability measure 
to be optimized must take into account the overlap of each class with each other class. Hence, classi- 
fication accuracy is not always satisfactory. 

This paper presents an alternative approach to dimensionality reduction and classification which 
involves a procedure for the automated design of a decision tree classifier , 5 This approach to classifi- 
cation will be described in more detail in section 2, but basically it is characterized by the fact that 
samples are subjected to a sequence of decision rules before they are assigned to a unique class. Each 
decision rule can leave ambiguity with regard to the precise class assignment of a sample. If the am- 
biguity is unacceptable for a particular application it can be removed by subsequently applied decision 
rules. When the structure is diagrammed to show the heirarchy among the decision rules it exhibits a 
characteristic tree-like aspect - thus the rubric “decision tree classifier.” 

There are numerous advantages to decision tree classification. Most importantly, the decision 
rules can be designed to be both inexpensive and effective since each rule is required to take into 
account only a small subset of the original classes and it is not required to remove all ambiguities. 

Also, there is considerable generality and flexibility associated with this type of classification. For 
instance, collateral data of a categorical nature such as soil type or political boundaries can be readily 
incorporated within the framework of a decision tree classifier. Also, it is easy to avoid situations in 
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which c nputer time is spent in removing ambiguities which are irrevelant to a particular application 
such as distinguishing among confuser crops in an agricultural scene. Of course, for these benefits to 
be realized it is important for decision trees to be well designed. This paper presents an automated 
technique for designing effective decision tree classifiers predicated only on apriori class statistics. 

The procedure relies on a set of two dimensional feature extractions and Bayes table look-up decision 
rules. Associated error matrices are computed and utilized to provide an optimal design of the de- 
cision tree at each so-called “node.” A byproduct of this procedure is a simple algorithm for com- 
puting the global probability of correct classification assuming the statistical independence of the 
decision rules. 

Section 2 provides a more precise definition of decision tree classification. Section 3 gives mathe- 
matical details on the technique for automated decision tree design. Section 4 gives an example of a 
simple application of the procedure using class statistics acquired from an actual LANDSAT scene. 
Section 5 summarizes results and discusses directions for future research. 


2. A Mathematical Description of Decision Tree Gassification 
The purpose of this section is to give a rigorous description of decision tree classification. 


Consider a classification problem with sample set M to be assigned to K classes indexed by a K 
dimensional index set A = {l , 2, . . . K } . Let Tj CM be the set of samples properly associated with 
the class indexed by j. Also define II = {<5, , S 2 , . . . <S n } as a set of nonempty subsets of <S which 
satisfies conditions and <4, OcSj, = <f> if i # i'. A generalized decision rule is defined as a trans- 

formation /:M -* n from the sample set M to a set II of disjoint subsets of cS. An element xerj CM is 
considered to be correctly classified by decision rule / if /(x) = J3 t and jetS. . 

It is convenient to express a generalized decision rule as a pair (D, II) where D is a set of param- 
eters which define the transformation from M to II and where II is an explicitly expressed set of sub- 
sets of ci. Since a decision tree classifier wilt be defined as a set of generalized decision rules which 
collectively satisfy certain conditions we shall refer to a pair (D, II) as a node. Because a sample is 
subjected to a given decision rule if and only if another decision rule has mapped the sample into a 
certain index subset, it is important to describe the partial ordering that must exist on a set of gen- 
eralized decision rules for that set to constitute a decision tree classifier. 

Definition: Let r). = (D p Ity, Ilj = {a u , . . . <S. >Nj } and = (D j? lip, IT = A , ^ 2 . . . 

jlj nj } be two nodes. Then ij. is a parent node of tj. if there exists a K < N. such that ^ j s = K • 

In this case r/. will be refered to as an offspring node of r?j 

A Set which consists of just one element will be called a unitary set and nodes whose index sub- 
sets consist en tirely of unitary sets will be called simple nodes. A node which does not have a parent 
will be called a root node. To complete our terminology, a node without offspring will be called a 
terminal node. 

Definition: A decision tree classifier is a set of nodes satisfying the following conditions 

(a) There is just one root node 

(b) Every node which is not a root node has a single parent 

In addition to conditions (a) and (b), a complete decision tree classifier satisfies condition (c): 

(c) For every element i in the original index set <4 there exists a node jj = (D, II). II = {A , , <4 2 , . . . 4 N } 
and a K < N such that 1 K is a unitary set and ie\. 
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There are a variety of intuitively pleasing properties possessed by complete decision tree classi 


tiers. For instance, one can show that all terminal nodes are simple nodes. To complete the defini- 
tion we must indicate precisely how a decision tree structure is used to classify samples. Let {qjj 
be a set of nodes forming a decision tree with reference to a sample set M and a set of classes indexed 
by index set 1. For every sample xeM there is a unique decision sequence S(x) composed of an ordered 


set of nodes in {rjjj with the following properties: 

(a) The first node in the sequence S(x) is the root node of { 17 . 

(b) If t?j = (Dj, II.), Ilj ={cij , , 2 . tj Nj } is the i th element inS(x), then 

" (Djt fig )»flg“{ r ^g j’^g 2 ’ • • • N | 


is the i + 1 at element of the sequence if t). maps x onto 'ij k e and U k = <5L k . If no available 
node satisfies this condition, the decision sequence terminates at 7 j.. 

In effect, S(x)is the unique set of generalized decision rules used by the decision tree {tjJj to 
classify xeM. The index set to which x is assigned is understood to be the image of x under the 
mapping defined by the terminating node in S(x). If {tjJj is a complete decision tree one can show 
that each sample is mapped onto a unitary index set. Hence, a complete decision tree classifier maps 


samples onto unique class indexes. 
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3. Decision Tree Design 


A complete decision tree can be designed in a natural fashion by first defining a structure for 
the root node. Next, for every non unitary index subset associated with the root node it is necessary 
to define another node which performs a further decomposition into smaller index subsets. The de* 
fining process continues until simple nodes are achieved and no further decomposition is possible. 

At each step of the process we are confronted with the problem of decomposing a certain index set 
<3 into a set of subsets n * {<3 t , <3 2 , . . . <S N ] and developing a computationally efficient generalized 
decision rule which maps samples into II in a way that provides adequate classification accuracy. We 
now concentrate on a solution to this problem. 

Let /:M -*■ A be a decision rule which unambiguously assigns samples xeM to elements in index 
set <3. Associated with / is an error matrix defined as 

£j(i j) -*• conditional probability that a sample from a class indexed by i is assigned by 
decision rule / to the class indexed by j 

Let n = {cS j , cS 2 , . . . <3 N } be a decomposition of <3 into disjoint subsets. The decision rule / 
uniquely defines a generalized decision rule /':M -*• II in the following natural way. Assume that 
/(x) = i and let 3j (i) be the element of II which contains i. Then f'(x) ^ v 3j (i) . Furthermore, given 
enor matrix K one can readily compute the probability of correct classification for/' which is 

»,•*£«, H <■> 

ifX 8 *<3j(j) 

where a. is the apriori class probability associated with the class indexed by i. The right side of 
equation one represents a very simple and efficient algorithm for computing classification accuracy. 

Hence, even for relatively large index sets it is possible to investigate every generalized decision rule 
generated by / and to choose the one associated wit! maximum classification accuracy. In this 
fashion it is possible to generate from a decision rule / with mediocre classification accuracy a gen- 
eralized decision rule /' with very high classification accuracy. 

! 
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For the procedure outlined above to be a feasible approach to decision tree design, two require- 
ments must be satisfied: 

(a) a decision rule / must be available which affords a satisfactory compromise between accu- 
racy and computational efficiency. 

(b) the error matrix must be computable. 

These conditions can be met with a two dimensional linear feature extraction matched with 
Bayesian table look-up decision rule. as suggested by Mobasseri and McGillem. 6 The feature extrac- 
tion is performed by means of a canonical analysis approach as suggested by Merembeck and 
Turner. 7 That is, if m is the dimension of the sample set M, we first seek an m dimensional row 
vector A i which maximizes the F-ratio 


F 


A 


aba t 

awa t 


( 2 ) 


where B is the between class covariance matrix as defined from class mean vectors and W is the pooled 
within class covariance matrix. The required A, is the eigenvector associated with the largest eigen- 
value of W' 1 B. Having determined A, we seek a row vector A, in the orthogonal compliment of Aj 
which maximizes the F-ratio defined by equation 2. It, in turn, ; s the eigenvector associated with the 
second largest eigenvalue of W -1 B. Our required two dimensional transformation matrix has A ( as 
the first row and A 2 as the second row. 

The first step in designing the associated table look-up classifier is to specify a location and 
dimensions of a rectangle in the transformed two dimensional feature space such that the rectangle 
contains at least 997? of the probability associated with each class density function under the usual 
normality assumption. Next the rectangle is divided into 256 equal area grid elements. Each element 
is associated with a class index according to the maximum likelihood classification of its midpoint. 

The resulting decision rule simply assigns to each transformed sample the class index of the grid ele- 
ment in which it is contained Samples which fall outside of the rectangle are assigned to the nearest 
grid element. 


It remains to define how the error matrix 8 of the above defined table look-up decision rule is 
computed. A given element 8(i j) of £ can be obtained by summing the integrals of the transfoimed 
normal density function associated with the i th class over each grid dement indexed by j. We have 
found that a good approximation can be obtained if the density functions are first represented in 
each grid by a two dimensional second order Taylor series expanded about the grid midpoint. The 
Taylor series representations rather than the density functions fire then integrated over the appropri- 
ate grid elements. 

An important byproduct of this approach to decision tree design is a convenient method for 
computing the associated global probability of correct classification under the assumption that the 
decision rules employed at each node are statistically independent. To see how the computation is 
performed, we first determine the conditional probability Pi that a sample from a class indexed by 
i is correctly classified by a complete decision tree Let { T 7 g } fi#Rj be a set of nodes 

»?j - (D g , n g ) n g ■ {dj g , cS 2(g , . . . 

such that for some j < N g , ie^ g . The probability that a sample from the class indexed by i is 
properly classified at node r? s is 

P M- 7" 1 S 8 (>. k > 0) 



where ic.S ^ g and fe’ g is the error matrix of the table look-up decision rule employed at node f? g . As 
usual, let <5 be the class index set and let {aj ie<S be a set of apriori clasj probabilities. Then assum- 
ing the statistical independence of decision rules, the global probability of correct classification is 

'-ZMT'm <4 > 

t«<S e«R, 
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4. An Example 


The procedure outlined in section 3 for automated decision tree design was incorporated into a 
FORTRAN program which now resides on an IBM 360/91 computer at the Goddard Space Flight 
Center. The input is a set of class mean vectors and covariance matrices. The output is a description 
of each node of a decision tree design. Each node description consists of a decomposition of an index 
set into index subsets, the coefficients of the two dimensional linear feature extraction, a decision 
table for the table look-up classifier and its associated confusion matrix. The output also includes a 
computation of the global probability of correct classification as obtained from equations 3 and 4. 

As described in section 3, the program designs the decision tree from the top down starting from the 
root node. The design terminates when each of the original class indexes appears in a unitary index 
subset. A simple flow chart for the program is included in figure 1. 

To provide an example of the application of the program, class statistics were obtained from a 
LANDSAT 2 scene taken over Finney County, Kansas during May of 1 975.® The five classes consisted 
of two types of winter wheat and three confuser crops. The class statistics were obtained from well 
known sites in Finney County. The four channels are those of the Multispectral Scanner on board 
the LANDSAT 2. The sizes of the training sample sets range from about one hundred to about three 
hundred. The ..ass statistics are shown in Table 1 . The information inTahle 1 was used as input to 
the program and the resulting decision tree design is shown as a tree diagram in figure 2. Table 1 
shows the part of the program output which describes the root node. Similar information about the 
other nodes is also made available. Tank 3 lists all possible designs for the root node and the associ- 
ated probabilities of correct classification. Design number 5 is seen to have the optimal probability 
of correct classification and is employed in the tree design as shown in figure 2. 

The global probability of correct classification for the decision tree shown in figure 2 was com- 
puted to be .76. from the results of a previous study 1 ' it is known that the theoretically optimal 4 
dimensional Bayes decision ride provides an accuracy of 7 ‘* 7 when applied to this problem. Hence, 
for this application the more efficient and more flexible decision tree approach provides a classifica- 
tion which is nearly optimum. 
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Table 1 

Statistics of LANDSAT-2 MSS Signatures Squired 
May 1975 Over Finney County, Kansas 


( 1 ) 184 Pixels of Non-Wheat 


Channel 


Mean 


Std.Dev. 


Covariance Matrix 


1 

27.7 

3.6 

12.7 

63.4 

SY Hton 

2 

24.5 

8.0 

25.0 

m i 

3 

75.1 

20.4 

-51.4 

-140.7 

415.5 

4 

37.4 

12.0 

-30.8 

-84.2 

242.1 


'rn/c 


143.4 


1(2) 333 Pixels of Non-Wheat 


1 

2 

3 

4 


34.7 

3.6 

12.7 



40.4 

5.5 

17.2 

30.0 


47.0 

5.2 

8.8 

9.9 

27.3 

19.7 

2.5 

0.6 

-1.2 

10.4 


6.0 


k3) 324 Pixels of Non-Wheat 


1 

333 

1.6 

2.6 


2 

38.5 

2.7 

2.6 

7.2 

3 

44.1 

6.4 

43 

2.5 

4 

18 7 

33 

1.9 

03 


41.2 

19.9 


11.1 


|(4) 1 06 Pixels of Winter Wheat 


1 

28.5 

2.4 

5.8 


2 

27.5 

4.0 

7.4 

16.2 

3 

51.2 

5.2 

-6.0 

-14.4 

4 

24.0 

3.0 

-43 

-8.9 


26.7 

14.1 


9.0 


(5) 127 Pixels of Winter Wheat 


1 

2 

3 

4 


21.5 

16.7 

54.9 

29.1 


2.7 
4.2 
5.1 

2.8 


7.3 

10.3 

4.1 

- 1.0 


18.0 

4.9 

- 2.8 


26.0 

11.4 


8.1 




Information Associated with die Root Node 
of Decision Tree Shown in Figure 2 


.48 

.71 

-.38 

.02 

-.06 

.58 


Error Matrix 

.01 1 .056 .045 

.473 .512 .015 

.192 .808 0 

.042 .178 .758 


0 0.004 0.991 


Decision Table 


5 

5 

5 

5 5 

5 1 

1 

i 

1 

1 

1 

1 


1 

1 

5 

5 

5 

5 5 

5 1 

1 

i 

1 

1 

1 

1 

1 

1 

1 

5 

5 

5 

5 5 

5 5 

1 

i 

1 

1 

1 

1 

1 

1 

1 

5 

5 

5 

S 5 

5 5 

1 

i 

1 

1 

1 

1 

1 

1 

1 

5 

5 

5 

5 5 

5 5 

1 

i 

1 

1 

1 

1 

1 

1 

1 

5 

5 

5 

5 5 

5 5 

5 

i 

1 

1 

1 

1 

1 

1 

1 

5 

5 

5 

5 5 

5 5 

1 

i 

1 

1 

1 

1 

1 

1 

1 

5 

5 

5 

5 5 

5 4 

1 

i 

1 

1 

1 

1 

1 

1 

1 

5 

5 

5 

4 4 

4 4 

4 

i 

1 

1 

1 

1 

1 

1 

1 

4 4 

4 4 4 

4 4 

4 

i 

I 

1 

1 

1 

1 

1 

1 

4 

4 

4 4 4 

4 4 

4 

i 

1 

1 

1 

1 

1 

1 

1 

4 

4 

4 

4 4 

4 4 


3 

3 

3 

3 

3 

3 

3 

3 

4 

'i 

2 

2 2 

3 3 

3 

3 

3 

3 

3 

3 

3 

3 

2 

2 

3 

3 

3 3 

3 3 

3 

3 

2 



2 

Ar 


2 

3 

3 

3 

3 3 

3 2 


2 

2 

2 

2 

2 

2 

2 

2 

3 

3 

3 

3 2 

2 2 


2 

2 


2 

2 

2 

T 

2 




Table 3 

All Possible Configurations of the Root Node 
and their Probabilities of Correct Classification 


1 

Node Structure 

ProbabUty 

I 

(1) (244,5) 

0.97827 

2 

(2) (1,3, 4.5) 

0.77264 

3 

(3) (1.24,5) 

0.75595 

4 

(4) (1*3,5) 

0.96345 

5 

(5) (1,24 A) 


6 

(14) (3*5) 

0.75465 

7 

(14) (24,5) 

0.75341 

8 

04) (244) 

0.95945 

9 

(14) (24A) 

0.97829 

10 

(24) (M4) 

0.96223 

11 

(2,4) (144) 

0.75343 

12 

(24) (14,4) 

0.76986 

13 

(34) (144) 

0.75467 

14 

(34) (14A) 

0.75317 

15 

(44) (144) 

0.96343 

16 

(1) (2) (344) 

0.75278 

17 

(1) (3) (2,4 4) 

0.74382 

18 

(l) (4) (244) 


19 

(1) (5) (24.4) 

0.97689 

20 

(2) (3) (1,44) 

0.74541 

21 

(2) (4) (1 44) 

0.74476 

22 

(2) (5) (144) 

0.76986 

23 

(3) (4) (144) 

0.73703 

24 

(3) (5) (144) 

0.75317 

25 

(4) (5) (144) 

0.96205 

26 

(1) (24) (44) 

045197 

27 

(1) (2,4) (34) 

0.74243 

28 

(1) (24) (34) 

0.75140 

29 

(1) (24) (44) 

0.74474 

30 

(1) (2,4) (34) 

0.74262 

31 

(2) (14) (34) 

0.75280 

32 

(3) (14) (44) 

0.73701 

33 

(4) (14) (34) 

0.73563 

34 

(5) (14) (34) 

0.75327 

35 

(3) (14) (24) 

0.74263 

36 

(3) (14) (24) 

0.74383 

37 

(4) (14) (2,5) 

0.74336 

38 

(5) (14) (24) 

0.75203 

39 

(4) (1,5) (24) 

0.95198 

40 

(5) (1,4) (24) 

0.95945 

41 

0) (2) (3) (44) 

0.73514 

42 

(1) (2) (4) (3,5) 

0.73376 

43 

(1) (2) (5) (34) 

0.75140 

44 

(1) (3) (4) (24) 

0.73376 

45 

(1) (3) (5) (2,4) 

0.74243 

46 

(1) (4) (5) (24) 

0.9S058 

47 

(2) (3) (4) (14) 

0.73516 

48 

(2) (3) (5) (1,4) 

0.74263 

49 

(2) (4) (5) (14) 

0.74336 

50 

(3) (4) (5) (14) 

0.73563 

51 

0) (2) (3) (4) (5) 

0.73376 







5. Summary and Recommendation for Future Research 
The classification of large dimensional data sets arising from the merging of remote sensing data 
with more traditional forms of ancillary data causes a significant computational problem. Decision 
tree classification is an increasingly popular approach to the problem. This type of classifier is char* 
acterized by the property that samples are subjected to a sequence of decision rules before they are 
assigned to a unique class. If a decision tree classifier is well designed, the result in many cases, is a 
classification scheme which is accurate, flexible, and computationally efficient 

It is useful to have available an automated procedure for effective decision tree design which 
relies only on apriori class statistics. The procedure described in this paper utilizes a set of two 
dimensional feature extractions and Bayes table look-up decision rules. An optimal design at each 
node is derived based on the associated error matrix. A procedure for computing the global prob- 
ability of correct classification is also provided. 

An example is provided in which class statistics obtained from an actual LANDSAT scene are 
used as input to the program. The resulting decision tree design shown in figure 2 has an associated 
probability of correct classification of .76 which compares reasonably to the theoretically optimum 
.79 probability of cover classification associated with a full dimensional Bayes classifier. 

The work documented in this report represents a promising depiction in the exploitation of 
decision tree classification. An obvious next step is to test the procedure on large dimensional merged 
data sets with results compared to ground truth information. Also, monte carlo studies are in order 
to validate the computational procedure for determining the global probability of correct classifica- 
tion as given in equations 3 and 4. This is particularly important for rather deep decision tree struc- 
tures where samples can be subjected to many decision rules before being finally classified. It is pos- 
sible in this situation that the independence assumption can lead to error. 

It is also clear that the automated procedure described in section 3 should be modified to include 
greater flexibility. For instance, it should be possible to permit a user to employ collateral data of a 
catagorical nature in defining certain mode structures of decision tree. Also, it should be possible to 
insure that a decision tree design reflect the fact that for a certain application, certain ambiguities 
among classes are irrevelcnt. As an example, for the case presented in section 4, classes 1, 2, and 3 
are con fuser crops in an agricultural scene. Hence, node D as represented in figure 2 can be deleted 
from the tree structure with no loss of useful information. 
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